Term Count on Search Results #3920

nik9000 · 2013-10-16T13:59:58Z

Would anyone else be interesting in getting elasticsearch to return a count of the terms in a field in the search results? If you (like me) need to return a word count of a field then this could be useful to you. I also could get a count of distinct terms but I'm not super sure who'd use it. I was thinking the api could be something like this:

curl -XPOST "http://localhost:9200/test/test/_search?pretty" -d '{
  "fields": [ "foo._term_count" ],
  "query": {
    "query_string": {
      "query": "findme"
    }
  }
}'

And it'd return "foo._term_count" : 6, in the results.

It'd require term_vectors to be stored but not offsets or positions. Since it'd count the terms on each search result it'd be similar to highlighting using the FVH but faster because it does essentially no work other than the term vector scanning.

I don't imagine you'd be able to sort by them.

The text was updated successfully, but these errors were encountered:

brwe · 2013-10-17T11:29:29Z

Might this requirement be similar to #3924 ? Also I am curious: What is your use case?

nik9000 · 2013-10-17T12:47:02Z

Sorry I wasn't clear. On my search results page I have to return a word count of one of the fields for every search result. It happens to be my longest field. And it has to support scriptio continua languages so I can't do something simple like count the number of spaces in my app and save that number to ES to retrieve with the search results. Anyway, Elasticsearch has a word count already in the form of the per field per document term vectors that I already store to use the FVH. Also luckily I process that field with an analyzer that doesn't add synonyms or funky word breaks. If I can ask Elasticsearch to count the terms in that field that'll give me my word count.

Anyway, it doesn't what I need is pretty simple in comparison to the term vector api. I won't be listing terms and I only want term information for a single document. I also want it bundled in the search results so I don't have to make any additional requests.

I'll send a pull request that implements this today or tomorrow which should make it crystal clear

s1monw · 2013-10-17T12:54:11Z

just for kicks, can you build a customer analyzer that consumes all tokens and returns the number of tokens in the field as a token and then sort by it. You would need to parse the string but it would work no?

synhershko · 2013-10-17T13:03:17Z

+1. Use cases can include faceting, scripted scoring, record linkage and whatnot.

@s1monw all that is required is a custom TokenFilter really, but that token doesn't have access to the IW / Document object so you will need to go through the analysis chain twice

nik9000 · 2013-10-17T18:09:28Z

record linkage

Sorry, what do you mean?

custom TokenFilter

I like this idea. In that case it'd make sense to build the field in the mapping, maybe like this:

curl -XPUT http://localhost:9200/test/test/_mapping?pretty -d'{
  "test" : {
    "properties": {
      "foo" : {
        "type": "string",
        "store": "yes",
        "write_term_count" : "foo_term_count"
      },
      "foo_term_count" : {
        "type": "integer",
        "store": "yes"
      }
    }
  }
}

It'd be a pain to have to use the custom analyzer and analyze everything twice but it'd be worth it if it enables lots of fun features. I'll have a look later today I think.

synhershko · 2013-10-17T20:58:26Z

Record linkage is whenever you want to find similar documents, and word count can be a good hint for that.

javanna · 2013-10-18T15:13:16Z

This other issue looks similar to what was asked here, although it proposes a separate api for it: #640 .

synhershko · 2013-10-19T16:37:45Z

I think someone is confusing Word-Count in a field of a specific document with Term Count of all documents in a field. Not sure who that is, though :)

javanna · 2013-10-19T17:27:54Z

Indeed, that other issue is a completely different story, I should have read more carefully. Thanks for clarifying that @synhershko

nik9000 · 2013-10-21T21:54:31Z

I got this working today. I'll send a pull request for it as soon as it passes all of its tests. Github has helpfully created a link to my implementation above for anyone curious. The unit test covers returning the count in the search results, searching for it via a range query, and faceting. It covers counting both single and multi-valued fields both on the root and inside of an object. For multi-valued fields it writes multiple term counts - it doesn't add them.

nik9000 · 2013-10-21T22:09:02Z

Also, while I think about it I'm pretty sure I did a few things wrong and would love some tips on the right way:

Create a new package for the implementation. I'm sure there is some place simple where it belongs.
I'm a bit hacky in the way that I override the Long field implementation and in the way I use that field regardless of what field the term counts are actually mapped to. Now that I think about it I didn't do a lot of testing around mapping the term count to things other than long. I mean, it won't work at all if you map it to something non-number, but short and int and the like should work as expected.
Everything else I haven't thought of:)

brwe mentioned this issue Oct 17, 2013

Feature Request: pre-select terms in TermVector request #3924

Closed

nik9000 mentioned this issue Oct 22, 2013

Allow string fields to store term counts #3945

Closed

ghost assigned jpountz Nov 19, 2013

jpountz closed this as completed Dec 3, 2013

mhkuu mentioned this issue Oct 20, 2015

Add additional linguistic information to saved queries UUDigitalHumanitieslab/texcavator#74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Term Count on Search Results #3920

Term Count on Search Results #3920

nik9000 commented Oct 16, 2013

brwe commented Oct 17, 2013

nik9000 commented Oct 17, 2013

s1monw commented Oct 17, 2013

synhershko commented Oct 17, 2013

nik9000 commented Oct 17, 2013

synhershko commented Oct 17, 2013

javanna commented Oct 18, 2013

synhershko commented Oct 19, 2013

javanna commented Oct 19, 2013

nik9000 commented Oct 21, 2013

nik9000 commented Oct 21, 2013

Term Count on Search Results #3920

Term Count on Search Results #3920

Comments

nik9000 commented Oct 16, 2013

brwe commented Oct 17, 2013

nik9000 commented Oct 17, 2013

s1monw commented Oct 17, 2013

synhershko commented Oct 17, 2013

nik9000 commented Oct 17, 2013

synhershko commented Oct 17, 2013

javanna commented Oct 18, 2013

synhershko commented Oct 19, 2013

javanna commented Oct 19, 2013

nik9000 commented Oct 21, 2013

nik9000 commented Oct 21, 2013